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Abstract 


We introduce Discriminative Bleu 
(ABleu), a novel metric for intrinsic 
evaluation of generated text in tasks that 
admit a diverse range of possible outputs. 
Reference strings are scored for quality 
by human raters on a scale of [—1, +1] 
to weight multi-reference Bleu. In tasks 
involving generation of conversational 
responses, ABleu correlates reasonably 
with human judgments and outperforms 
sentence-level and IBM Bleu in terms of 
both Spearman’s p and Kendall’s r. 


1 Introduction 


Many natural language processing tasks involve 
the generation of texts where a variety of outputs 
are acceptable or even desirable. Tasks with intrin¬ 
sically diverse targets range from machine transla¬ 
tion, summarization, sentence compression, para¬ 
phrase generation, and image-to-text to generation 
of conversational interactions. A major hurdle for 
these tasks is automation of evaluation, since the 
space of plausible outputs can be enormous, and 
it is it impractical to run a new human evaluation 
every time a new model is built or parameters are 
modified. 

In Statistical Machine Translation (SMT), the 
automation problem has to a large extent been ame¬ 
liorated by metrics such as Bleu (Papineni et al., 


2002) and Meteor (Banerjee and Lavie, 2005) 


Although Bleu is not immune from criticism (e.g., 
Callison-Burch et al. (2006)), its properties are well 
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understood, Bleu scores have been shown to cor¬ 


relate well with human judgments (Doddington 


2002[ Coughlin, 2003 |Graham and Baldwin, 2014 
Graham et al., 2015] ) in SMT, and it has allowed 
the field to proceed. 

Bleu has been less successfully applied to non- 
SMT generation tasks owing to the larger space 
of plausible outputs. As a result, attempts have 
been made to adapt the metric. To foster diversity 
in paraphrase generation, Sun and Zhou ( |2012[ ) 
propose a metric called iBLEU in which the Bleu 
score is discounted by a Bleu score computed be¬ 
tween the source and paraphrase. This solution, 
in addition to being dependent on a tunable pa¬ 
rameter, is specific only to paraphrase. In image 
captioning tasks, Vendantam et al. ( 2015| , employ 
a variant of Bleu in which n-grams are weighted 
by tf-idf. This assumes the availability of a coipus 
with which to compute tf-idf. Both the above can 
be seen as attempting to capture a notion of target 
goodness that is not being captured in Bleu. 

In this paper, we introduce Discriminative Bleu 
(ABleu), a new metric that embeds human judg¬ 
ments concerning the quality of reference sen¬ 
tences directly into the computation of corpus-level 
multiple-reference B LEU. In effect, we push part of 
the burden of human evaluation into the automated 
metric, where it can be repeatedly utilized. 

Our testbed for this metric is data-driven con¬ 
versation, a held that has begun to attract inter¬ 
est (Ritter et al ., 2011 ; |Sordoni et al., 2015] ) as an 
alternative to conventional rule-driven or scripted 
dialog systems. Intrinsic evaluation in this held 
is exceptionally challenging because the semantic 
space of possible responses resists definition and is 
only weakly constrained by conversational inputs. 

Below, we describe ABleu and investigate its 
characteristics in comparison to standard Bleu in 
the context of conversational response generation. 





























Context I'm on my way now. | 

Message [ I’ll be downstairs waiting. | 

Response j I'll keep an eye out for you. I 


Figure 1: Example of consecutive utterances of a dialog. 

We demonstrate that ABleu correlates well with 
human evaluation scores in this task and thus can 
provide a basis for automated training and evalua¬ 
tion of data-driven conversation systems—and, we 
ultimately believe, other text generation tasks with 
inherently diverse targets. 


data with the hope of finding distinct responses, 
but this solution is not feasible. Indeed, the larger 
the context, the less likely we are to find pairs that 
match exactly. Furthermore, while it is feasible to 
have writers create additional references when the 
downstream task is relatively unambiguous (e.g., 
MT), this approach is more questionable in the 
case of more subjective tasks such as conversa¬ 
tional response generation. Our solution is to mine 
candidate responses from conversational data and 
have judges rate the quality of these responses. Our 
new metric thus naturally incorporates qualitative 
weights associated with references. 


2 Evaluating Conversational Responses 3 Discriminative Bleu 


Given an input message m and a prior conversation 
history c, the goal of a response generation 
system is to produce a hypothesis h that is both 
well-formed and a pertinent response to message 
m (example in Fig.|Tj). We assume that a set of J 
references {rij} is available for the context c, and 
message to,;, where i £ {1 ... 1} is an index over 
the test set. In the case of BLEuQthe automatic 
score of the system output h\... hj is defined as: 


with: 


Bleu = BP • exp 



BP 


1 if t] > p 

e (t -p/n) otherwise 


( 1 ) 

( 2 ) 


where p and 77 are respectively hypothesis and refer¬ 
ence lengths^ Then corpus-level n-gram precision 
is defined as: 


Pn = 


£,;£ 


g E n-grams(/ij) 


max 




£,£ 


g E n-grams(Zii) 


#g{hi) 


where # 5 (-) is the number of occurrences of 
n-gram g in a given sentence, and # g (u,v) is a 
shorthand for min { 

It has been demonstrated that metrics such as 
Bleu show increased correlation with human judg¬ 


ment as the number of references increases (Przy- 
bocki et ah, 2008| Dreyer and Marcu, 2012]). Unfor¬ 
tunately, gathering multiple references is difficult 
in the case of conversations. Data gathered from 
naturally occurring conversations offer only one 
response per message. One could search (c, m) 
pairs that occur multiple times in conversational 


'Unless mentioned otherwise, Bleu refers to the original 
IBM Bleu as first described in iPapineni et al., 2002 '. 


2 In the case of multiple references, Bleu selects the refer¬ 
ence whose length is closest to that of the hypothesis. 


Discriminative B LEU, or AB LEU, extends B LEU 
by exploiting human qualitative judgments Wij £ 
[—1, +1] associated with references r,.y. It is dis¬ 
criminative in that it both rewards matches with 
“good” reference responses (w > 0 ) and penalizes 
matches with “bad” reference responses (w < 0 ). 
Formally, ABleu is defined as in Equation[l]and[2] 
except that p n is instead defined as: 

El Eg £ n-grams(fc,) maX Lg £ ^ {^ij ' #g(hj, rjj)} 
EiEg e „-grams(M max j { w ij ’ #9 ( h i) } 

In a nutshell, this is saying that each n-gram match 
is weighted by the highest scoring reference in 
which it occurs, and this weight can sometimes be 
negative. To ensure that the denominator is never 
zero, we assume that, for each i there exists at least 
one reference r,j whose weight Wij is strictly pos¬ 
itive. In addition to its discriminative nature, this 
metric has two interesting properties. First, if all 
weights Wij are equal to 1 , then the metric score is 
identical to Bleu. As such, ABleu admits Bleu 
as a special case. Second, as with IBM Bleu, the 
maximum theoretical score is also 1. If the hy¬ 
pothesis happens to match the highest weighted 
reference for each sentence, the numerator equals 
the denominator and the metric score becomes 1 . 
While we find this metric particularly appropriate 
for response generation, the metric makes no as¬ 
sumption on the task and is applicable to other text 
generation tasks such as MT and image captioning. 

4 Data 

4.1 Multi-reference Datasets 

To create the multi-reference Bleu dev and test 
sets used in this study, we adapted and extended the 

















Context c 

Message m 

Response r 

Score 

i was about to text you and my two cousins got 
excited cause they thought you were “rihanna” 

aww, i can imagine their disappointment 

they were very disappointed!!! 

0.6 

yes. my ex-boyfriend, killed my cat. like i say, 
it was the start of a bad time... 

i can imagine! 

yes. luckily, the whole thing feels very much of 
the past now. 

0.8 

its good., for some reason i can’t name stand 
out tracks but i’ve been playing it since it 
dropped 

i can imagine, banks doesn ’t disappoint 

na this is anything but a disappointment.. 

0.6 

at my lil cousins dancing to “dance for you 
these kids are a mess. 

Imaoo i can imagine. 

they were belly rolling, filarious. 

0.4 

what’s sick about it?? do you know how long it 
is?? no so how is it sick? 

i can imagine it 

your imagination is wrong, very wrong at that. 

-0.1 

it’s saad oh yeah the snow is very beautiful 

yeah i can imagine 

the weather in russia is very cool. 

-0.7 

walking outside with lightning and thunder 
sounds lOx scarier than when you ’re inside 

play in the rain 

haha i love listening to thunder! i would def be 
out there if i wasn’t so sleepy 

0.9 

also rain(s) 

no play, rain(s) 

i’m scared of the thunder 

0.5 

no beach or running today, stupid weather. 

lets play in the rain, “yolo” isnt that the 
vibe?? 

i would but its thundering to 

0.2 

then its raining... 

go outside and play in the rain loll 

if i had a buddy to go play in it with , i would. 
it’s no fun alone 

0.1 

my only official day off this week... now what 
to do 

play in the rain! 

yeah right haha. i wan na go swimming 

0.1 

now playing silence 

nice music i’m playing rain 

ohh i love that song 

-0.2 

yeah it is, i just dont see the point of the first 2 
laps 

me either it would be sick if there was like 5 
laps 

exactly and if there were more riders so they 
drafted and crashed each other 

0.3 

yippee how many 711 are there 

like 5! 

sweaarr i thought there were more 

0.8 

i ’ve got that muck off dry shower stuff to try out, 
hope it’s muddy now. 

i think if we go for it we cud get 5 laps in an 
hour! 

i’ve heard its a 30 min lap. but that was from a 
dh rider! 

0.6 

how much are they ? 

like $5 

i thought they were more then that but ok 

0.4 

igot you, wen iroll up ill pass that shit, iaint 
stingy. 

me either!! 

they more the marrier. 

-0.3 

i dont eat gravy on biscuits. 

me either. 

well then! why were the biscuits needed? 

-0.8 


Table 1: Sample reference sets created by our multi-reference extraction algorithm, along with the weights used in ABleu. 
Triples from which additional references are extracted are in italics. Boxed sentences are in our multi-reference dev set. 


methodology of Sordoni et al. (20151. From a cor¬ 
pus of 29M Twitter context-message-response con¬ 
versational triples, we randomly extracted approxi¬ 
mately 33K candidate triples that were then judged 
for conversational quality on a 5-point Likert-type 
scale by 3 crowdsourced annotators. Of these, 4232 
triples scored an average 4 or higher; these were 
randomly binned to create seed dev and test sets 
of 2118 triples and 2114 triples respectively. Note 
that the dev set is not used in the experiments of 
this paper, since ABleu and IBM Bleu are met¬ 
rics that do not require training. However, the dev 
set is released along with a test set in the dataset 
release accompanying this paper. 


We then sought to identify candidate triples in 
the 29M corpus for which both message and re¬ 
sponse are similar to the original messages and 
responses in these seed sets. To this end, we em¬ 
ployed an information retrieval algorithm with a 


bag-of-words BM25 similarity function (Robertson 
|et al., 1995] ), as detailed in Sordoni et al. ( 2015 ), 
to extract the top 15 responses for each message 


response pair. Unlike Sordoni et al. (2015), we 
further appended the original messages (as if par¬ 


roted back). The new triples were then scored for 
quality of the response in light of both context and 
message by 5 crowdsourced raters each on a 5- 
point Likert-type scalej^] Crucially, and again in 
contradistinction to Sordoni et al. ( 2015) , we did 
not impose a score cutoff on these synthetic multi¬ 
reference sets. Instead, we retained all candidate 
responses and scaled their scores into [—1, +1]. 

Table 1 presents representative multi-reference 
examples (from the dev set) together with their con¬ 
verted scores. The context and messages associated 
with the supplementary mined responses are also 
shown for illustrative puiposes to demonstrate the 
range of conversations from which they were taken. 
In the table, negative-weighted mined responses are 
semantically orthogonal to the intent of their newly 
assigned context and message. Strongly negatively 
weighted responses are completely out of the ball¬ 
park (“the weather in Russia is very cool”, “well 
then! Why were the biscuits needed?”); others are a 
little more plausible, but irrelevant or possibly topic 

3 For this work, we sought 2 additional annotations of the 
seed responses for consistency with the mined responses. As a 
result, scores for some seed responses slipped below our initial 
threshold of 4. Nonetheless, these responses were retained. 




























changing (“ohh I love that song”). Higher-valued 
positive-weighted mined responses are typically 
reasonably appropriate and relevant (even though 
extracted from a completely unrelated conversa¬ 
tion), and in some cases can outscore the original 
response, as can be seen in the third set of exam¬ 
ples. 


4.2 Human Evaluation of System Outputs 

Responses generated by the 7 systems used in this 
study on the 2114-triple test set were hand evalu¬ 
ated by 5 crowdsourced raters each on a 5-point 
Likert-type scale. From these 7 systems, 12 system 
pairs were evaluated, for a total of about pairwise 
126K ratings (12 • 5 • 2114). Here too, raters were 
asked to evaluate responses in terms of their rele¬ 
vance to both context and message. Outputs from 
different systems were randomly interleaved for 
presentation to the raters. We obtained human rat¬ 
ings on the following systems: 


Phrase-based MT: A phrase-based MT system 
similar to (Ritter et a l., 2011) , whose weights 
have been manually tuned. We also included 
four variants of that system, which we tuned with 
MERT (Och, 2003). These variants differ in their 


number of features, and augment (Ritter et al. 


2011 ) with the following phrase-level features: edit 


distance between source and target, cosine similar¬ 
ity, Jaccard index and distance, length ratio, and 
DSSM score (Huang et al., 2013). 

RNN-based MT: the log-probability according to 
the RNN model of ( jSordoni et al., 2015) ). 

Baseline: a random baseline. 

While ABleu relies on human qualitative judg¬ 
ments, it is important to note that human judgments 
on multi-references (§ |4. 1 [ > and those on system out¬ 
puts were collected completely independently. We 
also note that the set of systems listed above specif¬ 
ically does not include a retrieval-based model, as 
this might have introduced spurious correlation be¬ 


tween the two datasets (§ 4.1 and § 4.2 1 . 


5 Setup 

We use two rank correlation coefficients— 
Kendall’s r and Spearman’s p —to assess the level 
of correlation between human qualitative ratings 
(< 4.2) and automated metric scores. More formally, 
we compute each correlation coefficient on a series 
of paired observations (mi, q\), • • • , (mjv, q at). 
Here, m,; and qi are respectively differences in auto¬ 
matic metric scores and qualitative ratings for two 


given systems A and Bona given subset of the 
test setQvVhile much prior work assesses automatic 
metrics for MT and other tasks (Lavie and Agarwal, 
2007; Hodosh et al., 2013) by computing correla¬ 


tions on observations consisting of single-sentence 


system outputs, it has been shown (e.g., Przybocki 


et al. (2008)) that correlation coefficients signifi¬ 


cantly increase as observation units become larger. 
For instance, corpus-level or system-level correla¬ 
tions tend to be much higher than sentence-level 
correlations; |Graham and Baldwin (2014] ) show 
that Bleu is competitive with more recent and ad¬ 
vanced metrics when assessed at the system level j£] 
Therefore, we define our observation unit size to 
be M = 100 sentences (responses) j^] unless stated 
otherwise. We evaluate q r by averaging human rat¬ 
ings on the M sentences, and rri, by computing 
metric scores on the same set of sentences^] We 
compare three different metrics: Bleu, ABleu, 
and sentence-level Bleu (sBleu). The last com¬ 


putes sentence-level Bleu scores (Nakov et al. 


2012) and averages them on the M sentences (akin 


to macro-averaging). Finally, unless otherwise 
noted, all versions of B LEU use n-gram order up 
to 2 (Bleu-2), as this achieves better correlation 
for all metrics on this data. 


6 Results 

The main results of our study are shown in Table [2] 
ABleu achieves better correlation with human 
than Bleu, when comparing the best configura¬ 
tion of each metric^ In the case of Spearman’s p, 

4 For each given observation pair (m;, g;), we randomize 
the order in which A and B are presented to the raters in order 
to avoid any positional bias. 

5 We do not intend to minimize the benefit of a metric that 
would be competitive at the sentence-level, which would be 
particularly useful for detailed error analyses. However, our 
main goal is to reliably evaluate generation systems on test 
sets of thousands of sentences, in which case any metric with 
good corpus-level correlation (such as Bleu, as shown in 
^Graham and Baldwin, 2014}) would be sufficient. 

'’Enumerating all possible ways of assigning sentences to 
observations would cause a combinatorial explosion. Instead, 
for all our results we sample IK assignments and average 
correlations coefficients over them (using the same IK assign¬ 
ments across all metrics). These assignments are done in such 
a way that all sentences within an observation belong to the 
same system pair. 

7 We refrained from using larger units, as creating larger 
observation units M reduces the total number of units N. This 
would have caused confidence intervals to be so wide as to 
make this study inconclusive. 

s This is also the case on single reference. While ABleu 
and Bleu would have the same correlation if original refer¬ 
ences all had the same score of 1, it is not unusual for original 
references to get ratings below 1. 































Metric 

refs. 

Spearman’s p 

Kendall’s r 

Bleu 

Bleu 

Bleu 

single 
w > 0.6 

all 

.260 (.178, .337) 
.343 (.265, .416) 
.318 (.239, .392) 

.171 (.087, .252) 
.232 (.150, .312) 
.212 (.129, .292) 

sBleu 

sBleu 

sBleu 

single 

w > 0.6 

all 

.265 (.183, .342) 
.330 (.252, .404) 
.258 (.177, .336) 

.175 (.091, .256) 
.222 (.140, .302) 
.167 (.083, .249) 

ABleu 

ABleu 

ABleu 

single 

w > 0.6 

all 

.280 (.199,.357) 
.405 (.331,.474) 
.484 (.415,.546) 

.187 (.103, .268) 
.281 (.200, .357) 
.342 (.265, .415) 



Table 2: Human correlations for IBM Bleu, sentence-level 
Bleu, and ABleu with 95% confidence intervals. This 
compares 3 types of references: single only, high scoring 
references (w > 0.6), and all references. 


the confidence intervals of Bleu (.265, .416) and 
ABleu (.415, .546) barely overlap, while interval 
overlap is more significant in the case of Kendall’s 
r. Correlation coefficients degrade for Bleu as 
we go from w > 0.6 to using all references. This 
is expected, since Bleu treats all references as 
equal and has no way of discriminating between 
them. On the other hand, correlation coefficients 
increase for ABleu after adding lower scoring ref¬ 
erences. It is also worth noticing that Bleu and 
sBleu obtain roughly comparable correlation co¬ 
efficients. This may come as a surprise, because it 
has been suggested elsewhere that sBleu has much 
worse correlation than Bleu computed at the cor¬ 
pus level (Przybocki et ah, 2008). We surmise that 
(at least for this task and data) the differences in 
correlations between Bleu and sBleu observed 
in prior work may be less the result of a difference 
between micro- and macro-averaging than they are 
the effect of different observation unit sizes (as 
discussed in )|5]). 

Finally, Figure [2] shows how Spearman’s p is 
affected along three dimensions of study. In par¬ 
ticular, we see that ABleu actually benefits from 
the references with negative ratings. While the im¬ 
provement is not pronounced, we note that most ref¬ 
erences have positive ratings. Negatively-weighted 
references could have a greater effect if, for exam¬ 
ple, randomly extracted responses had also been 
annotated. 



(B) correlations with different observation unit sizes (M) 
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Figure 2: A comparison of Bleu, sentence-level Bleu, and 
ABleu along three dimensions: (A) decreasing the threshold 
on reference scores Wij m , (B) increasing the unit size for the 
correlation study from a single sentence (M= 1) to a size of 
100; (C) going from Bleu- 1 to Bleu- 4 for the different 
versions of Bleu. 


velopment]^] An upfront cost is paid for human 
evaluation of the reference set, but following that, 
the need for further human evaluation can be min¬ 
imized during system development. AB LEU may 
help other tasks that use multiple references for 
intrinsic evaluation, including image-to-text, sen¬ 
tence compression, and paraphrase generation, and 
even statistical machine translation. Evaluation of 
ABleu in these tasks awaits future work. 
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7 Conclusions 

ABleu correlates well with human quality judg¬ 
ments of generated conversational responses, out¬ 
performing both IBM Bleu and sentence-level 
Bleu in this task and demonstrating that it can 
serve as a plausible intrinsic metric for system de- 
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